{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<h2 align=\"center\">点击下列图标在线运行HanLP</h2>\n",
    "<div align=\"center\">\n",
    "\t<a href=\"https://colab.research.google.com/github/hankcs/HanLP/blob/doc-zh/plugins/hanlp_demo/hanlp_demo/zh/tok_mtl.ipynb\" target=\"_blank\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "\t<a href=\"https://mybinder.org/v2/gh/hankcs/HanLP/doc-zh?filepath=plugins%2Fhanlp_demo%2Fhanlp_demo%2Fzh%2Ftok_mtl.ipynb\" target=\"_blank\"><img src=\"https://mybinder.org/badge_logo.svg\" alt=\"Open In Binder\"/></a>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WfGpInivS0fG"
   },
   "source": [
    "## 安装"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IYwV-UkNNzFp"
   },
   "source": [
    "无论是Windows、Linux还是macOS，HanLP的安装只需一句话搞定："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "1Uf_u7ddMhUt"
   },
   "outputs": [],
   "source": [
    "!pip install hanlp -U"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pp-1KqEOOJ4t"
   },
   "source": [
    "## 加载模型\n",
    "HanLP的工作流程是先加载模型，模型的标示符存储在`hanlp.pretrained`这个包中，按照NLP任务归类。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "4M7ka0K5OMWU",
    "outputId": "9a1dc26a-786a-4dce-c013-7ae5017a8805"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'OPEN_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH': 'https://file.hankcs.com/hanlp/mtl/open_tok_pos_ner_srl_dep_sdp_con_electra_small_20201223_035557.zip',\n",
       " 'OPEN_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH': 'https://file.hankcs.com/hanlp/mtl/open_tok_pos_ner_srl_dep_sdp_con_electra_base_20201223_201906.zip',\n",
       " 'CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_SMALL_ZH': 'https://file.hankcs.com/hanlp/mtl/close_tok_pos_ner_srl_dep_sdp_con_electra_small_20210111_124159.zip',\n",
       " 'CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH': 'https://file.hankcs.com/hanlp/mtl/close_tok_pos_ner_srl_dep_sdp_con_electra_base_20210111_124519.zip',\n",
       " 'CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ERNIE_GRAM_ZH': 'https://file.hankcs.com/hanlp/mtl/close_tok_pos_ner_srl_dep_sdp_con_ernie_gram_base_aug_20210904_145403.zip',\n",
       " 'UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_MT5_SMALL': 'https://file.hankcs.com/hanlp/mtl/ud_ontonotes_tok_pos_lem_fea_ner_srl_dep_sdp_con_mt5_small_20210228_123458.zip',\n",
       " 'UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_XLMR_BASE': 'https://file.hankcs.com/hanlp/mtl/ud_ontonotes_tok_pos_lem_fea_ner_srl_dep_sdp_con_xlm_base_20210602_211620.zip',\n",
       " 'NPCMJ_UD_KYOTO_TOK_POS_CON_BERT_BASE_CHAR_JA': 'https://file.hankcs.com/hanlp/mtl/npcmj_ud_kyoto_tok_pos_ner_dep_con_srl_bert_base_char_ja_20210914_133742.zip'}"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import hanlp\n",
    "hanlp.pretrained.mtl.ALL # MTL多任务，具体任务见模型名称，语种见名称最后一个字段或相应语料库"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BMW528wGNulM"
   },
   "source": [
    "调用`hanlp.load`进行加载，模型会自动下载到本地缓存。自然语言处理分为许多任务，分词只是最初级的一个。与其每个任务单独创建一个模型，不如利用HanLP的联合模型一次性完成多个任务："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "0tmKBu7sNAXX",
    "outputId": "e0187328-c6d2-47fe-cf84-c5b44703940b"
   },
   "outputs": [],
   "source": [
    "HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "elA_UyssOut_"
   },
   "source": [
    "## 分词\n",
    "任务越少，速度越快。如指定仅执行分词，默认细粒度："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "id": "BqEmDMGGOtk3",
    "outputId": "387cbf30-4d70-44b1-d64b-b7a5c22ae31e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "阿婆主 来到 北京 立方庭 参观 自然 语义 科技 公司 。\n"
     ]
    }
   ],
   "source": [
    "HanLP('阿婆主来到北京立方庭参观自然语义科技公司。', tasks='tok').pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jj1Jk-2sPHYx"
   },
   "source": [
    "执行粗颗粒度分词："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 35
    },
    "id": "1goEC7znPNkI",
    "outputId": "ddf15a17-2f5d-4bc3-d145-908fb6176552"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "阿婆主 来到 北京立方庭 参观 自然语义科技公司 。\n"
     ]
    }
   ],
   "source": [
    "HanLP('阿婆主来到北京立方庭参观自然语义科技公司。', tasks='tok/coarse').pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wxctCigrTKu-"
   },
   "source": [
    "同时执行细粒度和粗粒度分词："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Zo08uquCTFSk",
    "outputId": "bf24a01a-a09b-4b78-fdec-2bb705b4becb"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'tok/fine': ['阿婆主', '来到', '北京', '立方庭', '参观', '自然', '语义', '科技', '公司', '。'],\n",
       " 'tok/coarse': ['阿婆主', '来到', '北京立方庭', '参观', '自然语义科技公司', '。']}"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "HanLP('阿婆主来到北京立方庭参观自然语义科技公司。', tasks='tok*')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`coarse`为粗分，`fine`为细分。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 注意\n",
    "Native API的输入单位限定为句子，需使用[多语种分句模型](https://github.com/hankcs/HanLP/blob/master/plugins/hanlp_demo/hanlp_demo/sent_split.py)或[基于规则的分句函数](https://github.com/hankcs/HanLP/blob/master/hanlp/utils/rules.py#L19)先行分句。RESTful同时支持全文、句子、已分词的句子。除此之外，RESTful和native两种API的语义设计完全一致，用户可以无缝互换。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "suUL042zPpLj"
   },
   "source": [
    "## 自定义词典\n",
    "自定义词典为分词任务的成员变量，要操作自定义词典，先获取分词任务，以细分标准为例："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "AzYShIssP6kq",
    "outputId": "7f07897c-8a97-4193-855d-d9e296581d0c"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<hanlp.components.mtl.tasks.tok.tag_tok.TaggingTokenization at 0x1527337f0>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok = HanLP['tok/fine']\n",
    "tok"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "自定义词典为分词任务的成员变量："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "id": "1q4MUpgVQNlu",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(None, None)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok.dict_combine, tok.dict_force"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "2zZkH9tRQOoi",
    "outputId": "c231c35b-1a5f-4b54-e5c3-8680d2cc1515",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "HanLP支持合并和强制两种优先级的自定义词典，以满足不同场景的需求。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "F-9gAeIVQUFG",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "不挂词典："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "F8M8cyBrQduw",
    "outputId": "c3bf7ec5-b1d4-4207-a979-2c85754c7cd7",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "商品 和 服务 项目\n"
     ]
    }
   ],
   "source": [
    "tok.dict_force = tok.dict_combine = None\n",
    "HanLP(\"商品和服务项目\", tasks='tok/fine').pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DDqQxqQaTayv",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### 强制模式\n",
    "强制模式优先输出正向最长匹配到的自定义词条（慎用，详见[《自然语言处理入门》](http://nlp.hankcs.com/book.php)第二章）："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "bjnEqDaATdVr",
    "outputId": "3a282acc-5716-45e4-e1e2-96eefb8ee342",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "商品 和服 务 项目\n"
     ]
    }
   ],
   "source": [
    "tok.dict_force = {'和服', '服务项目'}\n",
    "HanLP(\"商品和服务项目\", tasks='tok/fine').pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ldKAnVoSTgxb",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "与大众的朴素认知不同，词典优先级最高未必是好事，极有可能匹配到不该分出来的自定义词语，导致歧义。自定义词语越长，越不容易发生歧义。这启发我们将强制模式拓展为强制校正功能。\n",
    "\n",
    "强制校正原理相似，但会将匹配到的自定义词条替换为相应的分词结果:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "bwIu0f6wTgbF",
    "outputId": "b941b079-5202-420a-e7f3-8f1617a2545c",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "商品 和 服务 项目\n"
     ]
    }
   ],
   "source": [
    "tok.dict_force = {'和服务': ['和', '服务']}\n",
    "HanLP(\"商品和服务项目\", tasks='tok/fine').pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 合并模式\n",
    "合并模式的优先级低于统计模型，即`dict_combine`会在统计模型的分词结果上执行最长匹配并合并匹配到的词条。一般情况下，推荐使用该模式。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "商品 和 服务项目\n"
     ]
    }
   ],
   "source": [
    "tok.dict_force = None\n",
    "tok.dict_combine = {'和服', '服务项目'}\n",
    "HanLP(\"商品和服务项目\", tasks='tok/fine').pretty_print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9aRzEeRvTlRr"
   },
   "source": [
    "需要算法基础才能理解，初学者可参考[《自然语言处理入门》](http://nlp.hankcs.com/book.php)。\n",
    "#### 空格单词"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "含有空格、制表符等（Transformer tokenizer去掉的字符）的词语需要用`tuple`的形式提供："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['如何', '评价', 'iPad Pro', '？', 'iPad  Pro', '有', '2个空格']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok.dict_combine = {('iPad', 'Pro'), '2个空格'}\n",
    "HanLP(\"如何评价iPad Pro ？iPad  Pro有2个空格\", tasks='tok/fine')['tok/fine']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "聪明的用户请继续阅读，`tuple`词典中的字符串其实等价于该字符串的所有可能的切分方式："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys([('2', '个', '空格'), ('2', '个', '空', '格'), ('2', '个空', '格'), ('2', '个空格'), ('2个', '空', '格'), ('2个', '空格'), ('2个空格',), ('iPad', 'Pro'), ('2个空', '格')])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dict(tok.dict_combine.config[\"dictionary\"]).keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 单词位置"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "HanLP支持输出每个单词在文本中的原始位置，以便用于搜索引擎等场景。在词法分析中，非语素字符（空格、换行、制表符等）会被剔除，此时需要额外的位置信息才能定位每个单词："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['2021 年', 0, 6], ['HanLPv2.1', 7, 16], ['为', 17, 18], ['生产', 18, 20], ['环境', 20, 22], ['带来', 22, 24], ['次', 24, 25], ['世代', 25, 27], ['最', 27, 28], ['先进', 28, 30], ['的', 30, 31], ['多', 31, 32], ['语种', 32, 34], ['NLP', 34, 37], ['技术', 37, 39], ['。', 39, 40]]\n"
     ]
    }
   ],
   "source": [
    "tok.config.output_spans = True\n",
    "sent = '2021 年\\nHanLPv2.1 为生产环境带来次世代最先进的多语种NLP技术。'\n",
    "word_offsets = HanLP(sent, tasks='tok/fine')['tok/fine']\n",
    "print(word_offsets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "返回格式为三元组（单词，单词的起始下标，单词的终止下标），下标以字符级别计量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "for word, begin, end in word_offsets:\n",
    "    assert word == sent[begin:end]"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "authorship_tag": "ABX9TyNRpO7rdchCK1UmB0nQmPrG",
   "collapsed_sections": [],
   "include_colab_link": true,
   "name": "tok_mtl.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}